Regulation of mature mRNA levels by RNA processing efficiency

Abstract Transcription and co-transcriptional processes, including pre-mRNA splicing and mRNA cleavage and polyadenylation, regulate the production of mature mRNAs. The carboxyl terminal domain (CTD) of RNA polymerase (pol) II, which comprises 52 repeats of the Tyr1Ser2Pro3Thr4Ser5Pro6Ser7 peptide, is involved in the coordination of transcription with co-transcriptional processes. The pol II CTD is dynamically modified by protein phosphorylation, which regulates recruitment of transcription and co-transcriptional factors. We have investigated whether mature mRNA levels from intron-containing protein-coding genes are related to pol II CTD phosphorylation, RNA stability, and pre-mRNA splicing and mRNA cleavage and polyadenylation efficiency. We find that genes that produce a low level of mature mRNAs are associated with relatively high phosphorylation of the pol II CTD Thr4 residue, poor RNA processing, increased chromatin association of transcripts, and shorter RNA half-life. While these poorly-processed transcripts are degraded by the nuclear RNA exosome, our results indicate that in addition to RNA half-life, chromatin association due to a low RNA processing efficiency also plays an important role in the regulation of mature mRNA levels.


INTRODUCTION
Transcription of human protein coding genes by RN A pol ymerase (pol) II is a highly complex process requiring the coordination of multiple proteins. In addition to the transcription cycle, composed of transcription initiation, pol II pausing and release, transcription elongation, and transcription termination, co-transcriptional processes, including capping, splicing, cleav age and poly adenylation, and co-transcriptional loading of mRNA export factors are required for the production of mature mRNA ( 1 , 2 ). A key element regulating the crosstalk between transcription and co-transcriptional processes is the carboxyl terminal domain (CTD) of the large subunit of pol II, which comprises 52 repeats of the heptapeptide Tyr1-Ser2-Pro3-Thr4-Ser5-Pro6-Ser7. The pol II CTD can be modified by se v eral post-transla tional modifica tions (PTMs), including protein phosphoryla tion, methyla tion, acetyla tion, and proline isomeriza tion ( 3 , 4 ). Phosphoryla tion of the pol II CTD is one of the major PTMs and can occur on fiv e residues, Tyr1, Ser2, Thr4, Ser5 and Ser7 ( 3 , 4 ). Kinases, including se v eral cyclin-dependent kinases (CDKs), and phosphatases regulate the CTD phosphorylation pattern and le v el across the transcription cycle ( 3 , 4 ).
Phospho-Ser5 (Ser5P) and phospho-Ser2 (Ser2P) are the most studied modifications and are found at the promoter region and in the gene body / downstream of the poly(A) site, respecti v ely ( 4 ). Ser5P is associated with the recruitment of the mRN A ca pping complex and pre-mRNA splicing factors while Ser2P is linked to the recruitment of elongation factors and proteins of the mRNA cleavage and polyadenylation complex (CPA). The roles of the three other residues, Tyr1, Thr4, and Ser7, are less well understood ( 5 ). On protein-coding genes, phosphorylation of Tyr1 is present at promoter and transcription termination regions. Tyr1P has been found to be higher on antisense promoters (PROMPTs) and enhancers compared to proteincoding genes ( 6 , 7 ) and to increase following DNA doublestrand breaks and UV irradiation ( 8 , 9 ). In addition, mutation of three quarters of the tyrosine residues to alanine promotes transcriptional readthrough and a loss of the Mediator and Integrator complexes from pol II. Phosphorylation of Thr4 is found at the 3'end of protein-coding genes, indicating a potential role in transcription termination ( 10 , 11 ). In addition, Thr4P has been found to be higher on the gene bodies of long non-coding (lnc)RN As, w hich are known to be prone to pr ematur e transcription termination (PTT) ( 10 , 11 ). Mutation of Thr4 residues to alanine results in a transcription elongation defect on protein-coding genes and a 3'end-processing defect of histone gene transcripts ( 10 , 12 ). Phosphorylation of Ser7 is currently the least understood but follows a similar pattern to Ser2P. The combination of Ser2P and Ser7P has been shown to be involved in the recruitment of the Integrator complex to small nuclear (sn)RNA genes ( 13 ). Mutation of Ser7 residues to alanine results in a decreased transcription of snRNA genes and 3' processing of transcripts while protein coding genes do not seem to be affected ( 13 , 14 ).
Modification of the pol II CTD is ther efor e critical for coordinating co-transcriptional processes during transcription. In turn, co-transcriptional processes can affect transcription ( 2 ). A major example is the coupling between pre-mRN A splicing, mRN A cleavage and polyadenylation, and transcription termination. It has been shown via longread sequencing approaches that protein-coding genes transcripts that are poorly processed are associated with pol II transcriptional readthrough, likely due to a failure to recognize the poly(A) site ( 15 , 16 ). Additionally, transcriptional readthrough can be promoted by knockdown of CPA factors, cellular stresses, or viral infection (17)(18)(19).
To better understand how transcription and cotranscriptional processes regulate the production of mature mRNAs from intron-containing protein-coding genes (termed protein-coding genes in the rest of the manuscript), we took advantage of genome-wide data available for HeLa cells. We find that pol II on protein-coding genes producing relati v ely low le v els of mature mRNAs is hyperphosphorylated on the CTD Thr4, and to a lesser extent Tyr1, r esidues. Inter estingly, the r educed production of matur e mRNAs from these protein-coding genes is mediated by poor RN A processing, w hich results in a combination of chroma tin associa tion of the tr anscripts and degr adation by the nuclear RNA exosome of the poorly-processed transcripts that are released from chromatin.
Our results indicate that the regulation of RNA processing efficiency plays an important role in controlling gene expression through chromatin association of transcripts and degradation by the nuclear RNA exosome of chromatin-released transcripts. This RNA processing ef ficiency-dependent regula tion of mRNA le v els is shared between long non-coding (lnc)RNAs and protein-coding genes, indicating a general regulatory mechanism.

Genome-wide datasets
The genome-wide data used in this study are summarized in Table 1 .

mNET-seq analysis
Adapters were trimmed with Cutadapt version 1.18 in paired-end mode with the following options:minimum-length 10 -q 15,10 -j 16 -A GAT CGT CG-GA CTGTAGAA CTCTGAA C -a A GATCGGAA GA G-CA CA CGTCTGAA CTCCAGTCA C. Trimmed reads were mapped to the human GRCh38.p13 r efer ence sequence with STAR version 2.7.3a and the parameters: -runThr eadN 16 -r eadFilesCommand gunzip -c -k -limitB AMsortRAM 20000000000 -outSAMtype B AM SortedByCoor dinate. SAMtools v ersion 1.9 was used to retain the properly paired and mapped reads (-f 3). A custom python script ( 40 ) was used to obtain the 3' nucleotide of the second read and the strandedness of the first read. Strand-specific bam files were generated with SAMtools. FPKM-normalized bigwig files were created with deepTools2 bamCoverage tool with the parameters -bs 1 -p max -normalizeUsing RPKM.

ChIP-seq and mNuc-seq analysis
Adapters were trimmed with Cutadapt version 1.18 in paired-end mode with the following options:minimum-length 10 -q 15,10 -j 16 -A GAT CGT CG-GA CTGTAGAA CTCTGAA C -a A GATCGGAA GA G-CA CA CGTCTGAA CTCCAGTCA C. Trimmed reads were mapped to the human GRCh38.p13 r efer ence sequence with STAR version 2.7.3a and the parameters: -runThr eadN 16 -r eadFilesCommand gunzip -c -k -limitB AMsortRAM 20000000000 -outSAMtype B AM SortedByCoor dinate. SAMtools v ersion 1.9 was used to retain the properly paired and mapped reads (-f 3) and to remove PCR duplicates. Reads mapping to the DAC Exclusion List Regions (accession: ENCSR636HFF) were removed with BEDtools version 2.29.2 ( 41 ). FPKMnormalized bigwig files were created with deepTools2 bamCoverage tool with the parameters -bs 10 -p max -e -normalizeUsing RPKM.

Proteomic analysis section
The proteome data of HeLa cells were obtained from ( 42 ).
To determine whether some of the protein-coding genes could be misannotated lncRNAs, we used the global human proteome data from the Human PeptideAtlas version 2023-01 database ( 43 ). The list of genes found in the RNAseq data was compared to the list of proteins found in the global human proteomic data to keep only protein-coding genes that are supported experimentally by proteomic data.

Differ ential expr ession analysis
For differ ential expr ession analysis, the number of aligned reads per gene was obtained with STAR -quantMode GeneCounts option during the mapping of raw reads to the human genome or with HTSeq version 1.99.2 ( 44 ). The lists of differentially expressed genes were obtained with DESeq2 version 1.30.1 ( 45 ) and apeglm version 1.18.0 ( 46 ) keeping only the genes with a fold change < -2 or > 2 and an adjusted p-value below 0.05. The comparisons between POINT-seq, chromatin RNA-seq, and nucleoplasm RNAseq were performed by quantifying aligned reads only on exons as the co-transcriptional pre-mRNA splicing rates, and ther efor e the number of reads mapped on introns, differ between the three techniques (see below).

Human gene annotation and selection of subset of genes
Gencode V38 annotation, which is based on the hg38 version of the human genome, was used to obtain the list of all pr otein-coding genes. Intr onless and histone genes were removed to obtain intron-containing protein-coding genes.
For each gene, we kept the annotation (TSS and poly(A) site) of the highest expressed transcript isoform, which was obtained with Salmon version 1.2.1 on four HeLa chromatin RNA-seq experiments. Only transcripts that are expressed (TPM > 0) in at least three of the four biological r eplicates wer e r etained. The list of similarly expr essed of nucleoplasm-enriched genes, chromatin-enriched genes, and non-enriched genes was generated through iterati v e random-subsampling to achie v e subsets of 500 genes with the most similar expression level and distribution in the chromatin RNA-seq. For the total pol II mNET-seq subsets, a comparable subsampling was performed but rather than using 500 genes, we used 10% of similarly expressed genes from each category as we could not obtain a nonsignificant difference with 500 genes due to the difference in nascent transcription le v el between chromatin-enriched and nucleoplasm-enriched genes. The set of long lncRNAs has been obtained from ( 11 ) using the long intergenic noncoding (linc)RNAs and the antisense transcripts. As the lncRNAs annotation was originally from the hg19 version of the human genome, we overlapped the hg19 annotation with the hg38 annotation and obtained a list of 632 lncRNAs.

Splicing efficiency
The splicing efficiency on POINT-seq and RNA-seq was calculated by first parsing each bam file to obtain the list of spliced and unspliced reads with the awk command (awk ' / ∧ @ / || $ 6 ∼ / N / ' for spliced reads and awk ' / ∧ @ / || $ 6 ! ∼ / N / ' for unspliced reads). The splicing efficiency was then calculated as the number of spliced reads over total reads with BEDtools multicov -s -split. The splicing efficiency of each transcript was then normalised to the number of exons. Significant changes in splicing e v ents following knockdown of the nuclear RNA exosome were obtained with rMATS version 4.1.2 with the options: bam file input, paired-end mode ( 47 ).

T r anscription termination index
The transcription termination index is defined as in ( 40 ) and we termed it readthrough index in this paper:

Metaprofiles, boxplots, and violin plots
Metaprofiles were generated from the matrix output of deepTools2 computeMatrix tool, run on scale-regions mode with the following parameters: -bs 10 -p max -m 4000 -b 2500 -a 2500 with the -maxThreshold 2500 parameter added for mNET-seq analysis. The metaprofiles , boxplots , and violin plots were generated with R version 4.0.5 using the ggplot2, ggpubr and gridExtra packages. Quantifications have been performed across the gene body, TSS to TES. ChIP-seq and mNuc-seq metaprofiles are shown as IP / Input signal.

Statistical tests
The statistical tests are indicated in the figures legends and were performed with R version 4.0.5.

A subset of expressed protein-coding genes produces a low amount of mature mRNA and protein
Previous studies have shown that a subset of lincRNAs, named lincRNA-like protein-coding genes, are similar to mRNAs in that they undergo RNA processing and produce stable nuclear RNA. In addition, a limited number of protein-coding gene transcripts have been found to have features common to lincRNAs, including high chromatin r eads r elati v e to transcription le v els and RNA e xosome sensiti vity ( 11 , 48 ). Howe v er, the RNA e xosome is only one of the protein complex es r egulating RNA production. We have ther efor e mor e widely investigated the amount of mature mRNAs pr oduced fr om intr on-containing pr otein-coding genes in relation to transcription and chromatin association of transcripts. We have used chromatin and nucleoplasm RNA-seq data available from HeLa cells to identify the proteincoding genes whose transcripts are enriched in the nucleoplasm compared to the chromatin (nucleoplasm-enriched, log 2 (fold change) > 1, adjusted P -value < 0.05) or that are enriched on the chromatin compared to the nucleoplasm (chromatin-enriched, log 2 (fold change) < -1, adjusted P -value < 0.05) (Figure 1 A). As the proportion of intronic reads differs between chromatin RNA-seq and nucleoplasm RNA-seq (see below), we ther efor e focused our analysis on exonic reads. From the DESeq2 analysis, we initially found 4803 nucleoplasm-enriched proteincoding genes and 6130 chromatin-enriched protein-coding genes. We removed histone and intronless genes to keep intr on-containing pr otein-coding genes and also genes that were not found to be expressed in the nucleoplasm RNAseq. We then kept for each remaining gene the annotation (TSS and poly(A) site) of the most expressed transcript in the nucleoplasm RNA-seq to obtain a list of 4686 transcripts from nucleoplasm-enriched genes and 4119 genes transcripts from chromatin-enriched genes (Figure 1 B). For comparison purposes, we have also generated a set of 4140 genes encoding non-enriched transcripts (no significant difference between nucleoplasm RNA-seq and chromatin RNA-seq), defined as the chromatin-and nucleoplasmenriched categories. As control, we also used a set of 632 lncRNAs, containing lincRNAs and antisense transcripts from ( 11 ), which shows a high POINT-seq and chromatin RNA-seq signals, dri v en by the highly expressed lncRNAs such as MALAT1 and NEAT1 , but a limited nucleoplasm RNA-seq signal, located between genes encoding chromatin-enriched and non-enriched transcripts. These r esults agr ee with the expected degradation of lncR-NAs by the nuclear RNA exosome (Supplementary Figure  S1A) ( 11 , 48 ). To determine whether transcripts from the nucleoplasm-enriched and chromatin-enriched genes are exported to the cytoplasm, we re-analysed HeLa cytoplasmic RNA-seq data, which show efficient cytoplasmic export of transcripts from nucleoplasm-enriched genes while mR-NAs fr om chr omatin-enriched genes have a limited presence in the cytoplasm (Figure 1 C). To determine whether lower nucleoplasm and cytoplasm RNA-seq signals also result in a lower pr otein pr oduction, we compared our RNA-seq results with a pre viously-pub lished re-analysis of HeLa proteome datasets ( 42 ). We could only match ∼5000 genes of the RNA-seq / proteome data but found that fewer of the chromatin-enriched transcripts produce proteins compared to nucleoplasm-enriched transcripts, and chromatin-enriched transcripts produce fewer peptides of each protein (Figure 1 D). As only 448 chromatin-enriched transcripts were found to pr oduce pr oteins in the HeLa proteome datasets, we checked whether genes encoding chromatin-enriched transcripts could be incorrectly annotated lncRNAs. We compared the RNA-seq results with a database of 2299 proteomic experiments from the Human PeptideAtlas ( 43 ) version 2023-01 database and found that 3442 chromatin-enriched transcripts (out of 4119 transcripts, 83.6%) produce peptides (Supplementary Figure  S1B). In contrast, 4583 nucleoplasm-enriched transcripts (out of 4686 transcripts, 97.8%), and 4088 non-enriched transcripts (out of 4138 transcripts, 98.8%) produce proteins (Supplementary Figure S1B). For the set of 632 lncR-NAs from ( 11 ), we found only one lncRNA with peptide support from the Human PeptideAtlas w hile onl y nine genes annotated as lncRNA in the Gencode V38 annotation have peptide support. These results demonstrate that most genes encoding chromatin-enriched transcripts are transcriptionally acti v e and hav e the potential to produce proteins, with ∼17% of these genes being potentially incorrectl y annotated lncRN As. For the subsequent anal yses, we removed all the chromatin-enriched, nucleoplasmenriched, and non-enriched transcripts that were not experimentally supported in the Human PeptideAtlas version 2023-01 database.
As genes encoding chromatin-enriched transcripts are transcriptionally acti v e but produce only a small amount of proteins, we investigated whether the transcripts produced by the chromatin-enriched genes ar e mor e prone to degradation. We compared our RNA-seq results with a pre viously-pub lished HeLa mRNA half-life dataset ( 49 ), which provides half-life data for the transcripts of 1446 chromatin-enriched, 3691 non-enriched, and 4070 nucleoplasm-enriched genes ( Figure 1 E and F). We found a clear trend where mRNAs from chromatin-enriched genes ha ve on a verage shorter half-lives compared to transcripts from non-enriched genes and nucleoplasm-enriched genes, with the latter having on average the longest half-lives.
To determine whether the chromatin enrichment of transcripts is explained only by r apid degr adation by the nuclear RNA exosome, we re-analysed HeLa POINT-seq data ( 16 ), which captures nascent RNA transcription and co-transcriptional splicing ( Figure 1 G and H). POINTseq profiles of genes encoding chromatin-enriched, nonenriched, and nucleoplasm-enriched transcripts were similar to the profiles obtained from chromatin RNA-seq (Figure 1 B). Chromatin RNA-seq data r epr esents a combination of nascent transcription and of transcripts associated with chromatin while POINT-seq data are a measure of nascent tr anscription, e.g. tr anscripts associated with pol II. We ther efor e r easoned that a comparison of POINT-seq and chromatin RNA-seq data over exons, which avoids technical differences on co-transcriptional splicing efficiency (see below), could indicate increased chroma tin associa tion if the chroma tin RNA-seq signal is enriched compared to the POINT-seq signal (Figure 1 I). We found tha t chroma tin-enriched transcripts ar e mor e prone to chroma tin associa tion (blue points with a high log2 (Fold change)) compared to non-enriched (red) and nucleoplasmenriched (purple) transcripts. As chromatin-enriched genes are on average longer than non-enriched and nucleoplasmenriched genes (Supplementary Figure S1C), we investigated whether the higher chromatin association of transcripts fr om chr omatin-enriched genes is due to their longer size (Supplementary Figure S1D). We observed a moderate positi v e correlation ( R = 0.32, P < 2.2e-16), indicating that gene length only partially explains the higher chromatin retention of transcripts fr om chr omatin-enriched genes. In contrast, for the set of 632 lncRNAs there is no clear chroma tin associa tion indica ting tha t these lncRNAs ar e r eleased from chromatin following transcription (Supplementary Figure S1D).
These findings indicate that in addition to RNA stability, the production of mature mRNAs can also be regulated by chroma tin associa tion of transcripts.

Higher level of pol II thr4 phosphorylation might be associated with poor expression and chromatin enrichment of transcripts
Tr anscription of lncRNA tr anscripts is associated with a different pol II CTD phosphorylation pattern and with poor co-transcriptional RNA processing, including defecti v e pre-mRNA splicing and mRNA CPA ( 11 ). We theref ore in vestigated whether the pol II CTD phosphorylation patterns and / or le v els also differ between nucleoplasmenriched and chromatin-enriched genes. We re-analysed HeLa mNET-seq data for total pol II and the different CTD phosphorylation marks, using Empigen-treated mNET-seq datasets when available as these identify bone fide mNETseq signals without non-nascent RNA associated with the pol II (Figure 2 A and Supplementary Figure S1E). While the total pol II profile follow the expected mNET-seq pattern for the three groups of genes, we observed a higher to-tal pol II le v el on nucleoplasm-enriched genes compared to chromatin-enriched genes (Figure 2 B and Supplementary Figure S1F and S1G). We theref ore in vestigated whether nucleoplasm-enriched genes ar e mor e expr essed (i.e. mor e pol II transcribing the genes) and / or pol II is slower, which will also result in a higher pol II signal. To dif ferentia te betw een these possibilities, w e compared the pol II elongation ra te da ta obtained for 1398 protein-coding genes in HeLa cells ( 50 ) to the nucleoplasm-enrichment ratio, which corresponds to log 2 (Fold change of Nucleoplasm RNA-seq versus Chromatin RNA-seq) (Supplementary Figure S1H). We did not observe any correlation between the pol II elonga tion ra te and nucleoplasm-enrichment, indica ting tha t nucleoplasm-enriched genes are not associated with slower pol II elongation and that chromatin-enriched genes are also not transcribed faster. The higher signals of total pol II, but also of POINT-seq and chromatin RNA-seq (Figure 1 B and H), on nucleoplasm-enriched genes compared to chromatin-enriched genes are therefore likely explained by a higher transcriptional le v el.
The different pol II CTD phosphorylation profiles follow the expected mNET-seq pattern for the three groups of genes, with a higher signal for all CTD phosphorylation marks on nucleoplasm-enriched genes compared to chromatin-enriched genes (Figure 2 B) ( 4 , 11 ). For the set of 632 lncRNAs, the mNET-seq patterns are also in agreement with what was pre viously pub lished ( Supplementary Figure S2A) ( 11 ). As total pol II le v els differ between the three groups of genes, we ratioed each CTD phosphorylation signal to total pol II to determine whether the CTD phosphorylation le v els ar e similar between the thr ee groups of genes (Figure 2 C). We found that the nucleoplasm-enriched genes have generally less phosphorylated CTD than the chromatin-enriched genes or non-enriched genes. In contrast, Tyr1 and Thr4 phosphorylation le v els are higher in the gene body of the chromatin-enriched genes compared to the non-enriched genes or the nucleoplasm-enriched genes. For lncRNAs, we found a high Tyr1 phosphorylation le v el while the serine residues are less phosphorylated (Supplementary Figure S2B). We note howe v er that the ratio approach is limited by potential differences in epitope accessibility and a lack of spike-in controls. Single gene examples of a nucleoplasm-enriched gene ( PKM ), a chromatinenriched gene ( NFIA ), and a lncRNA ( LINC01409 ) are shown in Figure 2 D.
As Tyr1P and Thr4P le v els are associated with a lower le v el of nascent transcription, we investigated whether the higher relati v e Tyr1P and Thr4P we observed for chromatin-enriched genes could be due to generally lower expression of these genes rather than a chromatin enrichment-specific CTD phosphorylation pattern ( Figure  2 B and Supplementary Figure S1F and G). To correct for differ ence in expr ession, we selected for each group, via an iterati v e random-subsampling approach, a subset of 500 genes with the most similar expression level and distribution in chromatin RNA-seq (see Materials and Methods, Figure 3 A-C, Supplementary Figure S2C and D). Re-analysis of the CTD phosphorylation mNET-seq ratioed to total pol II on these three subsets of 500 genes indica tes tha t the nucleoplasm-enriched genes have less Ser2, Thr4, Ser5, and Ser7 phosphorylation while the chromatin-enriched genes still have higher Thr4P, and to a lesser extent, Tyr1P (Figure 3 D). To confirm the results, we also selected 10% of genes from each group, via an iterati v e random-subsampling approach, to obtain a similar nascent e xpression le v el and distribution from total pol II mNET-seq (see Methods, Figure 3 E-G). Re-analysis of the CTD phosphoryla tion mNET-seq ra tioed to total pol II on these three subsets of genes indicates that the nucleoplasmenriched genes no longer have less Ser2, Thr4, Ser5 and Ser7 phosphorylation while the chromatin-enriched genes still have higher Tyr1 and Thr4 phosphorylation (Figure  3 H). The subsampling performed with total pol II mNETseq results in the selection of chromatin-enriched genes that are amongst the most expressed (compare y -axis of Figure  3 E, average value around 5, to Supplementary Figure S1F, average value around 0.1). As hyperphosphorylation of the pol II CTD Thr4 residues is observed when the mark is either unratioed or ratioed to pol II for this subset of highly expressed chromatin-enriched genes, the higher Thr4 phosphorylation of chromatin-enriched genes is not simply due to generally lower expression but could be a feature of this category of genes. We also investigated why the lower CTD phosphorylation to pol II ratio observed on nucleoplasm-enriched genes disappeared with the total pol II mNET-seq subsampling. Unlike the full gene set and chromatin RNAseq subset of nucleoplasm-enriched genes, the total pol II mNET-seq subset of genes is on average not shorter than the chromatin-enriched and the non-enriched genes (Supplementary Figure S1C and Supplementary Figure S3A and 3B). Howe v er, we could not find any correlation between CTD phosphoryla tion ra tioed to pol II and gene length (Supplementary Figure S3C), indicating another reason behind the disappearance. Comparison of the nucleoplasmenrichment ratio with relati v e CTD phosphorylation reveals that the most nucleoplasm enriched genes have lower phosphorylation le v els for Ser2, Ser5 and Ser7 ratioed to total pol II signal (see distribution of data points for the genes with a nucleoplasm-enrichment ratio above 4 in Supplementary Figure S3D). We ther efor e plotted the distribution of the nucleoplasm-enrichment ratios for the different subsets (Supplementary Figure S3E). The total pol II mNET-seq subset of nucleoplasm-enriched genes shows a specific decrease in the average nucleoplasm-enrichment ratio compared to the total genes and the chromatin RNA-seq subset genes, likely explaining the disappearance of the lower CTD phosphorylation le v els ratioed to total pol II.
To confirm the observations made in HeLa cells, we also reanal ysed chromatin RN A-seq and total RN A-seq from Raji cells ( 51 ) (Supplementary Figure S4A and B). Reanalysis of Raji pol II CTD datasets ( 6 , 10 ) on the groups of Raji total-enriched, non-enriched, and chromatin-enriched also indicates that there is higher Tyr1 and Thr4 phosphoryla tion on chroma tin-enriched genes while total-enriched genes have less Tyr1, Ser2 and Thr4 phosphorylation (Supplementary Figure S4C and D). Comparison of the genes expressed in both HeLa and Raji shows that 47-63% of the nucleoplasm / total-enriched genes and 19-44% of chromatin-enriched genes are common between both cell lines (Supplementary Figure S4E). In addition, ∼160 genes were found to be in opposite categories between the two cell lines (chromatin-enriched to nucleoplasm / total-enriched or vice-versa).
These findings indicate that higher phosphorylation of Thr4, and to a lesser extent Tyr1, could be markers of chromatin-enriched genes while the nucleoplasm-enriched genes with the highest nucleoplasm enrichment ratio are ra ther associa ted with lower CTD phosphoryla tion le v els.

T r anscripts fr om chr omatin-enriched genes are poorly pr ocessed
As pol II CTD phosphorylation is associated with cotranscriptional processes and we observed differences in CTD phosphorylation le v els between nucleoplasmenriched and chromatin-enriched genes, we investigated w hether RN A processing efficiency, including pre-mRN A splicing and mRNA CPA, also differs between the two groups of genes. For pr e-mRNA processing, we r e-analysed HeLa POINT-seq data, which captures nascent RNA transcription and co-transcriptional splicing (Figure 1 G) ( 16 ). We calculated co-transcriptional splicing efficiency as the ratio of spliced reads over total reads across each introncontaining protein-coding transcript. As expected, we observed a correlation between the number of exons and our measure of splicing efficiency (Supplementary Figure  S5A), which agrees with previous observa tions tha t gene length positi v ely correlates with co-transcriptional splicing efficiency ( 52 ). As the distribution of the number of exons per gene differs between the chromatin-enriched, nucleoplasm-enriched, and non-enriched genes (Supplementary Figure S5B), we normalized the splicing efficiency of each transcript to its number of exons (Figure 4 A and  B). We find for the three datasets (POINT-seq, chromatin RNA-seq, and nucleoplasm RNA-seq) that the chromatinenriched gene transcripts have the lowest splicing efficiency while the nucleoplasm-enriched gene transcripts have the highest splicing efficiency. Importantl y, w hile chromatinenriched genes are on average longer than non-enriched and nucleoplasm-enriched genes (Supplementary Figure S1C), we observed a lower splicing efficiency on the transcripts fr om chr omatin-enriched genes. As expected, lncRNAs are associated with a poor splicing efficiency on POINT-seq, chromatin RNA-seq, and nucleoplasm RNA-seq (Supplementary Figure S5C) ( 11 ). We confirmed the HeLa results with the Raji chromatin RNA-seq and total RNAseq datasets (Supplementary Figure S5D). We show as examples PKM and NFIA , a nucleoplasm-enriched and a chromatin-enriched gene, respecti v ely, and the lncRNA ( LINC01409 ) with co-transcriptional splicing e v ents indicated by a star (Figure 4 C). While co-transcriptional splicing is visible on both NFIA and PKM , NFIA has str onger intr onic signals compared to PKM , especially on the chromatin RNA-seq and nucleoplasm RNA-seq. As expected, the lncRNA shows poor co-transcriptional splicing. We also investigated whether there is a more general correlation between splicing efficiency in the POINTseq or RNA-seq data and the production of mature mRNAs (nucleoplasm-enrichment ratio) (Figure 4 D). We found that co-transcriptional splicing efficiency of nascent RNA does not correlate as well as the splicing efficiency of chromatin and nucleoplasm RNA-seq with the As co-transcriptional splicing is associated with deposition of trimethylation on histone H3 lysine 36 (H3K36me3) by SETD2 ( 53 ), we re-analysed HeLa mNuc-seq datasets for H3K36me3 ( 54 ) (Supplementary Figure S5E and S5F). In line with less efficient co-transcriptional splicing, chromatin-enriched genes have also lower H3K36me3 across the gene body.
As poor co-transcriptional splicing is also associated with transcriptional readthrough due to failure to recognise the poly(A) site ( 15 , 16 ), we analysed HeLa ChIP-seq of three CPA factors, CPSF73, PCF11, and Xrn2 (  Figure  S6C and S6D) ( 11 , 40 ). The chromatin-enriched genes and lncRNAs are less sensiti v e to the loss of CPSF73 compared to non-enriched and nucleoplasm-enriched genes, which have more transcriptional readthrough following siCPSF73 treatment.
These findings demonstrate that e v en though chromatinenriched gene transcripts are on average longer, these transcripts are poorl y processed, w hich could explain their higher chromatin association.

T r anscripts from chromatin-enriched genes are sensitive to the nuclear RNA exosome
The nuclear RNA exosome complex promotes RNA degradation, for example of pre-mRNAs with processing defects, such as those with retained introns or transcriptional readthrough ( 55 ). We investigated whether the nuclear RNA exosome complex could degrade the transcripts fr om chr omatin-enriched genes, w hich are poorl y processed. We r e-analysed pr e viously pub lished HeLa nucleoplasm RNA-seq treated with siLuc or siEXOSC3 (siEX3), a core component of the nuclear RNA exosome activity (Supplementary Figure S7A) ( 11 ). 551 transcripts, including 518 from protein-coding genes were downregulated and 1926 transcripts, including 427 fr om pr otein-coding genes, wer e up-r egulated after depletion of the RNA exosome. Comparison of siEX3 upregulated genes (siEX3(+)) with chromatin-enriched or nucleoplasm-enriched genes shows only a moderate correlation ( R = -0.29, P < 2.2e-16) (Supplementary Figure S7B). This initial analysis of the nuclear RNA ex osome knockdo wn sho ws a limited effect on the mRNA le v els based on exons. To determine whether an effect is observed on intron retention, a splicing defect tar-geted by the nuclear RNA exosome ( 55 ), we used rMATS on the chromatin and nucleoplasm RNA-seq before and after siEXOSC3 to obtain the list of significant splicing changes, including alternati v e 5' and 3' splice sites, m utuall y e xclusi v e exons , retained introns , and skipped exons ( Figure 6 A and Supplementary Figure S7C). We found a specific increase in intron retention cases in nucleoplasm RNA-seq following siEXOSC3, indicating that these poorly-processed transcripts are usually degraded by the nuclear RNA exosome.
To determine whether transcripts targeted by the nuclear RNA exosome are coming fr om chr omatin-enriched, nonenriched, or nucleoplasm-enriched genes, we compared the changes in expression across the whole transcript units (exons and introns) before and after treatment with siEXOSC3 ( Figure 6 B and Supplementary Figure S7D and S7E). We found that the expression of transcripts from chromatinenriched genes is mor e incr eased after siEXOSC3 compared to transcripts from non-enriched and nucleoplasm-enriched genes. To confirm the observa tion tha t the increase of transcripts fr om chr omatin-enriched genes are also poorly pr ocessed, we calculated the splicing efficiency before and after siEXOSC3 ( Figure 6 C and Supplementary Figure S7F). Importantly, we found that the splicing efficiency of transcripts fr om chr omatin-enriched genes are the most decreased follo wing knockdo wn of the nuclear RNA exosome, which indicates an increase in the chromatin and nucleoplasm of poorly-processed transcripts from these genes. The increase in poorly-spliced transcripts from chromatinenriched genes after siEXOSC3 can be observed on single gene examples, NFIA and MDM4 , while no obvious changes in RNA splicing are visible for the transcripts of the nucleoplasm-enriched genes PKM and PSAP (Figure 6 D). For the lncRNA LINC01409 , we observed the expected increase in nucleoplasm RNA-seq following siEXOSC3 (Figure 6 D).
These findings indicate that poorly-processed transcripts fr om chr oma tin-enriched genes tha t ar e r eleased in the nucleoplasm are degraded by the nuclear RNA exosome.

DISCUSSION
Production of mature mRNA requires both transcription and co / post-transcriptional RNA processing. Regulation of gene expression via the control of transcription initiation and pol II pause release are well established ( 56 ). We show here that the efficiency of RNA processing is also an important factor controlling the chromatin association of transcripts, potentially because of higher R-loop le v els ( 57 ), and the degradation in the nucleoplasm of poorlyprocessed transcripts by the nuclear RNA exosome ( 55 , 58 ). There is a large subgroup of protein-coding genes that are transcribed but the transcripts are poorly processed and chroma tin-associa ted, which results in poor production of mature mRNAs and proteins. While we observe an enrichment of chromatin-enriched transcripts in the chromatin RNA-seq data compared to the POINT-seq data, the comparison does not provide a direct demonstration of chromatin association of the transcripts. Novel experimental approaches will be needed to properly quantify the extent of chromatin retention of transcripts. How ever, w e found that this subset of chromatin-enriched genes shares

Figur e 7.
RN A processing ef ficiency regula tes ma ture mRNA le v el via a combination of chroma tin associa tion and nuclear RNA degrada tion. The high production of mature mRNAs of nucleoplasm-enriched genes is associated with a more efficient pre-mRNA splicing and mRNA CPA, resulting in the production of more stable mRNA that will be exported to the cytoplasm to be translated. In contrast, transcripts from chromatin-enriched genes are associated with higher phosphorylation of pol II CTD Thr4 residues, less efficient pre-mRNA splicing and mRNA CPA, chromatin association of the poor ly processed tr anscripts, and a shorter mRNA half-life due to degradation by the nuclear RNA exosome of the poor ly processed tr anscripts that are located in the nucleoplasm.
tr anscriptional and co-tr anscriptional similarities with lncRNA genes ( 11 ). These include higher CTD Thr4 phosphorylation, poor pre-mRNA splicing and CPA, higher transcriptional r eadthrough, decr eased sensitivity to CPSF73 KD, and degradation of the poorly-processed transcripts in the nucleoplasm by the nuclear RNA exosome (Figure 7 ). Together these observations explain the low levels of mature mRNAs from these genes and indicate that the cellular mechanisms that regulate le v els of lncRNAs are also used to regulate expression of protein-coding genes. Of interest, Schlackow et al ( 11 ) also found a small subset of lncRNA genes whose transcripts are processed efficiently and not retained on the chromatin. These observations indica te tha t there is overlap between protein-coding genes and lncRNA genes in terms of the mechanisms opera ting a t the tr anscriptional and co-tr anscriptional le v els. The efficiency of transcription and co-transcriptional processes across transcription units, including protein-coding and non-coding genes, can be viewed as a continuum with poorly-expressed and poorly-processed lncRNAs at one end and highly-expressed and efficiently processed mRNAs at the other end with some overlap in the middle. Some of the chromatin-enriched genes are well transcribed but produce hardly any mature mRNAs and proteins, which begs the question: what is the cellular advan-tage of transcribing a protein-coding gene without producing a protein? It is possible that these genes are transcribed but the transcripts are poorly processed until their proteins ar e r equir ed, which would r equir e only the activation of RNA processing. Our da ta indica te tha t the downregulation of these genes occurs via poor co-transcriptional RNA pr ocessing, chr oma tin associa tion of the transcripts, and degradation of the transcripts by the nuclear RNA exosome rather than low transcription. In addition, overlap between the HeLa and Raji datasets show a higher proportion of common genes between nucleoplasm-enriched genes (47-63%) compared to chromatin-enriched genes (19-44%), which indicates a higher di v ersity in transcribed but poorlyprocessed transcript genes, at least between these two cancer cell lines. Interestingl y, w hile we found only ∼160 genes that are in opposing ca tegories (chroma tin-enriched in one cell line and nucleoplasm-enriched in the other, or vice-versa) between the two cell lines, this indicates that genes could potentially move from one category to another depending on the cell line or following a cellular stress, for example.
A surprising observation is the lower CTD phosphorylation le v el on the nucleoplasm-enriched genes, especially the genes with the highest nucleoplasm-enrichment ratios. As pol II CTD phosphorylation is known to recruit splicing proteins and mRNA CPA factors ( 59 ), it is unexpected that lower Ser2P and Ser5P le v els are associated with better RNA processing. Slow pol II elongation has been shown to result in hyperphosphorylation of the CTD Ser2 residues at the 5' end of genes, promoting a higher dwell time at start sites and a reduced transcriptional polarity ( 60 ). In addition, hyperphosphorylation of the pol II CTD during M phase inhibits pol II, which contributes to mitotic gene silencing ( 61 , 62 ). While more work is required, these observations suggest that the le v el of pol II CTD phosphorylation could play an important role in controlling transcription activity and co-transcriptional processing efficiency. However, the decrease in CTD phosphorylation for the nucleoplasmenriched genes or the higher Tyr1 and Thr4 phosphoryla tion for chroma tin-enriched genes was generally observed only after normalisation to total pol II. The measurement of pol II CTD phosphorylation using antibodybased technique contains se v eral potential pitfalls, including different affinities of each antibody, the influence of other pol II CTD modifications on the antibody specificity, and CTD-interacting proteins that can influence antibody accessibility.
We previously found that inhibition of the protein phosphatase PP2A causes a higher production of poly(A)+ mRNA without any significant changes in transcription le v el ( 34 ), and this is likely due to more efficient cleavage and polyadenyla tion. Modula tion of RNA processing efficiency can ther efor e r egulate gene expr ession at se v eral different le v els.

DA T A A V AILABILITY
The public data sources used in this study are described in the Materials and Methods section. Code and data to reproduce results and figures is available on Zenodo: https: //doi.org/10.5281/zenodo.7933151 .

SUPPLEMENT ARY DA T A
Supplementary Data are available at NARGAB Online.